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The Dirichlet process (DP) is a fundamental mathematical tool 
for Bayesian nonparametric modeling, and is widely used in tasks 
such as density estimation, natural language processing, and time se- 
ries modeling. Although MCMC inference methods for the DP often 
^f^ provide a gold standard in terms asymptotic accuracy, they can be 

^H computationally expensive and are not obviously parallelizable. We 

^^ propose a reparameterization of the Dirichlet process that induces 

Cn conditional independencies between the atoms that form the random 

%^ measure. This conditional independence enables many of the Markov 

Mh chain transition operators for DP inference to be simulated in paral- 

-^T lei across multiple cores. Applied to mixture modeling, our approach 

enables the Dirichlet process to simultaneously learn clusters that 
describe the data and superclusters that define the granularity of 
parallelization. Unlike previous approaches, our technique does not 
require alteration of the model and leaves the true posterior distribu- 
tion invariant. It also naturally lends itself to a distributed software 
implementation in terms of Map-Reduce, which we test in cluster 
configurations of over 50 machines and 100 cores. We present experi- 
ments exploring the parallel efficiency and convergence properties of 
C^ our approach on both synthetic and real-world data, including runs 

on IMM data vectors in 256 dimensions. 
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Cs| 1. Introduction. Bayesian nonparametric models are a remarkable class of stochastic 

^ objects that enable one to define infinite dimensional random variables that have tractable 

pg finite dimensional projections. This projective property often makes it possible to construct 

,^ probabilistic models which can automatically balance simplicity and complexity in the pos- 

^ terior distribution. The Gaussian process (see, e.g., Adler and Taylor (2007); Rasmussen and 

CO Williams (2006)), Dirichlet process (Ferguson, 1973, 1974) and Indian buffet process (Grif- 

fiths and Ghahramani, 2006; Ghahramani et al., 2007) are the most common building blocks 
, ^ for Bayesian nonparametric models, and they have found uses in a wide variety of domains: 

S^ natural language models (Teh et al., 2006), computer vision (Sudderth et al., 2006), activity 

modeling (Fox et al., 2009a), among many others. 

Most commonly, Bayesian nonparametric models use the infinite dimensional construction 
to place priors on the latent parameters of the model, such as in Dirichlet process mixtures 
(Escobar and West, 1995; Rasmussen, 2000), Gaussian Cox processes (M0ller et al., 1998; 
Adams et al., 2009a), and latent feature models (Fox et al., 2009b). This approach to priors 
for latent structure is appealing as the evidence for, e.g., a particular number of components in 
a mixture, is often weak and we wish to be maximally flexible in our specification of the model. 
Unfortunately, the use of Bayesian nonparametric priors for latent structure often yields mod- 
els whose posterior distribution cannot be directly manipulated; indeed a density is often un- 
available. In practice, it is therefore common to perform approximate inference using Markov 
chain Monte Carlo (MCMC), in which posterior computations are performed via Monte Carlo 
estimates from samples. These samples are obtained via a Markov chain that leaves the pos- 
terior distribution invariant. Remarkably, MCMC moves can be simulated on practical finite 
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Fig 1: Illustration of our auxiliary variable representation applied to Dirichlet pro- 
cess mixtures, (a) to (c) show three superclusters, one per compute node, with independent 
Dirichlet process mixtures, (d) shows their hnear combination, another Dirichlet process mix- 
ture. Note that the clusters within each supercluster need not be similar for our scheme to 
deliver efficiency gains. 



computers for many Bayesian nonparametric models, despite being infinite-dimensional, e.g., 
Rasmussen (2000); Neal (2000); Walker (2007); Papaspiliopoulos and Roberts (2008); Adams 
et al. (2009b). This property arises when finite data sets recruit only a finite projection of 
the underlying infinite object. Most practical Bayesian nonparametric models of interest are 
designed with this requirement in mind. 

Markov chain Monte Carlo, however, brings with it frustrations. Chief among these is the 
perception that MCMC is computationally expensive and not scalable. This is conflated with 
the observation that the Markovian nature of such inference techniques necessarily require 
the computations to be sequential. In this paper, we challenge both of these conventional 
wisdoms for one of the most important classes of Bayesian nonparametric model, the Dirichlet 
process mixture (DPM). We take a novel approach to this problem that exploits invariance 
properties of the Dirichlet process to reparameterize the random measure in such a way that 
conditional independence is introduced between sets of atoms. These induced independencies, 
which are themselves inferred as part of the MCMC procedure, enable transition operators on 
different parts of the posterior to be simulated in parallel on different hardware, with minimal 
communication. Unlike previous parallelizing schemes such as Asuncion et al. (2008), our 
approach does not alter the prior or require an approximating target distribution. We find 
that this parallelism results in real- world gains as measured by several different metrics against 
wall-clock time. 

2. The Dirichlet Process. The Dirichlet process defines a distribution over probability 
measures in terms of a base probability measure i7 on a sample space O and a concentration 
parameter o: > 0. In its most general form, a Dirichlet process is characterized by the prop- 
erty that any finite measurable partition of a O leads to a finite Dirichlet distribution over 
the associated probability measure. That is, if O is partitioned into Ai, ^2, • • • , Am^ then the 
probability measure G(Ai)^ ^(^2), • • • , G{Am) has a finite Dirichlet distribution with param- 
eters aH{Ai), aH{A2), • • • , aH{AM)- As an alternative to this somewhat abstract definition. 
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DP probability measures can also be constructed from a stick breaking process: 



G = 2_. ^j ^Oj Uj \a r^ Beta(l, a) 
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where it is clear from this construction that the G are discrete with probability one. To achieve 
continuous density functions, the DP is often used as a part of an infinite mixture model: 

oo 

(1) F = Y.^,Fe^^ 

where Fq is a parametric family of component distributions and the base measure H is now 
interpreted as a prior on this family. This Dirichlet process mixture model (DPM) is frequently 
used for model-based clustering, in which data belonging to a single Fq are considered to form 
a group. The Dirichlet process allows for this model to possess an unbounded number of 
such clusters. This view leads to a related object called the Chinese restaurant process in 
which the ttj are integrated out and one considers the infinitely exchangeable distribution 
over groupings alone. 

3. Nesting Partitions in the Dirichlet Process. Our objective in this work is to 
construct an auxiliary- variable representation of the Dirichlet process in which 1) the clusters 
are partitioned into "superclusters" that can be separately assigned to independent com- 
pute nodes; 2) most Markov chain Monte Carlo transition operators for DPM inference can 
be performed in parallel on these nodes; and 3) the original Dirichlet process prior is kept 
intact, regardless of the distributed representation. We will assume that there are K super- 
clusters, indexed by k. We will use j G N to index the clusters uniquely across superclusters, 
with 5j G {1, 2, • • • , K} being the supercluster to which j is assigned. 

The main theoretical insight that we use to construct our auxiliary representation is that 
the marginal distribution over the mass allocation of the superclusters arises directly from 
Ferguson's definition of the Dirichlet process. That is, we can generate a random DF{a,H) 
partitioning of O in stages. First, choose vector /x on the K-dimensional simplex, i.e., fik > 
and ^j^lJik — 1- Next, draw another vector 7, also on the i^-simplex, from a Dirichlet 
distribution with base measure a^i: 

(2) 71,72,- •• ,7x ^ Dirichlet (Q:/ii,a/X2,--- ,«Mx). 

Then draw K random distributions from K independent Dirichlet processes with base mea- 
sure H and concentration parameters afik- These are then mixed together with the ^k'- 

K 

(3) Gk - DP(a/x,, H) G^Y. ^^G^' 

/c=l 

This procedure results in G ^ DP(a, i7). Note that the result of this formulation is a Dirichlet 
process in which the components have been partitioned into K superclusters such that each 
contains its own "local" Dirichlet process. 
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Fig 2: Numerical calculations regarding the impact of auxiliary variable scheme on 
sampler efficiency, (a) Plot of sampling efficiency (effective number of samples per MCMC 
iteration) from the prior as a function of the number of local-machine sweeps over the data 
per cross-machine update. The lines correspond to different concentration parameters. The 
Chinese restaurant representation was used, with ten superclusters, one thousand data and 
100,000 iterations. Note that sampler efficiency is roughly independent of super-cluster update 
ratio and increases with higher concentration parameter, indicating that more parallel gain can 
be expected the higher the number of latent clusters in the training dataset. (b) The posterior 
probability distribution on the Dirichlet concentration parameter for various configurations 
balanced mixture models. The number of clusters is varied from 128 to 2048, and the number 
of data points per cluster is varied from 1024 to 4096. As larger concentration parameters 
imply more room for parallelization in our method, this view represents the opportunity for 
parallel gains as a function of the latent structure and quantity of the data. 



Marginalizing out the sticks of each local Dirichlet process results naturally in a Chinese 
restaurant process with concentration parameter afi^- Interestingly, we can also integrate out 
the ^k to construct a two-stage Chinese restaurant variant. Each "customer" first chooses one 
of the K "restaurants" according to its popularity: 



Pr (datum n chooses supercluster k \ a) 



^f^k + Yln'=i 1(5^,/ = k) 



a + n 



This corresponds to the predictive distribution of the Dirichlet-multinomial over superclusters. 
In the second stage, the customer chooses a table Zn at their chosen restaurant k according 
to its popularity among other customers at that restaurant: 



Pr(2:^ = extant component j \ a, Sj — k) — 
VT{zn — new component j \ a, Sj — k) — 



«/ffc 



4. Markov Transition Operators for Parallelized Inference. The goal of this dis- 
tributed representation is to provide MCMC transition operators that can efficiently use mul- 
tiple cores for inference. We have observed N data {xn}n=i ^^^ we wish to simulate a Markov 



CLUSTERCLUSTER: PARALLEL MARKOV CHAIN MONTE CARLO 



Initialization 



Gibbs scan i- 



Subsample 





/ 


;* 

^ 


Map 




~CRP(a/xi) 






Uniform 




\ 


split 


\ 


\ 


~ CRP(a/Xfc) 




1 








~CRP(a/XK) 1 



Gibbs scan i 

Reduce 



Gibbs scan i+1 




Fig 3: The map-reduce dataflow for our distributed sampler. During initialization, 
the data are randomly assigned to compute nodes (superclusters). On each MCMC scan, 
independent mappers perform clustering assuming fixed hyperparameters A. These hyperpa- 
rameters, along with the concentration parameter and assignments of clusters to superclusters 
are updated in the reduce step. Finally, clusters are shuffled amongst superclusters and the 
new latent state is communicated to each worker, leaving the cluster in a suitable state for 
further iterations. The Dirichlet process is effectively learning both how to cluster the data 
and at what granularity to parallelize inference. 



chain on the associated posterior distribution. Following the standard approach for mixture 
modelling, we introduce latent variables z^ E N that identify the cluster to which datum n 
is assigned. In the Markov chain, we will also represent the cluster-specific parameters Qj 
(although some model choices allow these to be marginalized away) and the concentration 
parameter a. 

We introduce some notation for convenience when referring to counts: 

TV 
num data in s.c. k: ^j^ = > ^{^zn — ^) 

n=l 

N 



num data in cluster j: #j = > I{zn = j) 



n=l 

oo 



extant clusters in s.c. j: Jk — / I(#/c > 0, 5j 



k) 



We use these to examine the prior over the z^ and Sj\ 

Dirichlet-Multinomial 



K independent CRPs 



(4) Pr(K},{5,}|a) 
(5) 



r(a) 



T{N + a] 



K 

n 

k=i 






r(#fc + aiik) 

fid-- 



K 



n^"^'^' 



Jk . 



r(a/x^ 



.k=i 



T{afik + #/c) 



k=i 



Note that the terms cancel out so that the result is a marginal Chinese restaurant process 
multiplied by a multinomial over how the components are distributed over the superclusters. 

Updating a given the Zn'-- This operation must be centralized, but is lightweight. Each su- 
percluster communicates its number of clusters Jk and these are used to sample from the 
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conditional (assuming prior p{a)): 

(6) P{a I {^nlli) c p(a)^,^lM_aEL: J., 

This can be done with shce samphng or adaptive rejection samphng. 

Updating base measure hyperparameters:. It is often the case in Bayesian hierarchical models 
that there are parameters governing the base measure H. These are typically hyperparameters 
that determine the priors on cluster-specific parameters 9 and constrain the behavior of Fq. 
Updates to these parameters are performed in the reduce step, based on sufficient statistics 
transmitted from the map step. 

Updating 0j given Zn'-- These are model-specific updates that can be done in parallel, as 
each 6j is only asked to explain data that belong to supercluster Sj. 

Updating Zn given Sj, 9j, and a:. This is typically the most expensive MCMC update for 
Dirichlet process mixtures: the hypothesis over cluster assignments must be modified for 
each datum. However, if only local components are considered, then this update can be par- 
allelized. Moreover, as the reparameterization induces K conditonally independent Dirichlet 
processes, standard DPM techniques, such as Neal (2000), Walker (2007), or Papaspiliopoulos 
and Roberts (2008) can be used per supercluster without modification. Data cannot move to 
components on different machines (in different superclusters), but can instantiate previously- 
unseen clusters within its local superclusters in the standard way. Note that the fik scaling 
causes these new components to be instantiated with the correct probability. 

Updating Sj given Zn and a:. As data can only move between clusters that are local to their 
machine, i.e., within the same supercluster, it is necessary to move data between machines. 
One efficient strategy for this is to move entire clusters, along with their associated data to 
new superclusters. This is a centralized update, but it only requires communicating a set of 
data indices and one set of component parameters. The update itself is straightforward: Gibbs 
sampling according to the Dirichlet-multinomial, given the other assignments. In particular, 
we note that since 9j moves with the cluster, the likelihood does not participate in the com- 
putation of the transition operator. We define J/.\j to be the number of extant clusters in 
supercluster fc, ignoring cluster j. The conditional posterior update is then 

\J) Pl'i^j = ^ I {Jk'\j}k'=l^ «) = ^^k' — ^^ — - — . 

^ + l^k' = l ^k'\j 

5. Distributed implementation. The conditional independencies inherent in the Markov 
chain transition operators we have defined correspond naturally with an efficient, distributed 
implementation in terms of Map- Reduce (Dean and Ghemawat, 2008). Figure 3 describes the 
workflow. These operators, implemented as mappers and reducers, act on a distributed repre- 
sentation of the latent state, that is also based on the independencies in our auxiliary variable 
representation. Intuitively, each mapper performs MCMC updates on an independent cluster- 
ing problem (within the supercluster it corresponds to), assuming fixed hyperparameters. The 
reduce step collects the latent state together and updates hyperparameters, while the shuffie 
step broadcasts the new hyperparameters and shuffies clusters amongst the superclusters. 

Our system software implementation, described in Figure 4, is based on Python implemen- 
tations, with modest optimization (in Cython) for the most compute-intensive inner loops. 
We use the Hadoop open source framework for Map-Reduce, and perform experiments using 
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Fig 4: Software architecture for our im- 
plementation. Scaling to a million datapoints 
was achievable despite only modestly optimized 
Python implementations of transition operators. 
Typical runs involved 10 to 50 machines with 2 
to 4 cores each. Although we focused on Elastic 
Compute Cloud experiments, our architecture is 
appropriate for other distributed computing plat- 
forms. 



Fig 5: Our parallel sampler constructs ac- 
curate density estimates for many synthetic 
data sources. We generated synthetic datasets 
from finite mixture models ranging from 200,000 
up to one million datapoints and from 128 clus- 
ters up to 2048 clusters. Marker designates the 
number of clusters, colors indicate the number of 
datapoints. Data is jittered for visual clarity. 



on-demand compute clusters from Amazon's Elastic Compute Cloud. Typical configurations 
for our experiments involved 10-50 machines, each with 2-4 cores, stressing gains due to par- 
allelism were possible despite significant inter-machine communication overhead. 

We used a uniform prior over the superclusters, i.e., fik = l/K. For initialization, we perform 
a small calibration run (on 1-10% of the data) using a serial implementation of MCMC 
inference, and use this to choose the initial concentration parameter a. We then assign data 
to superclusters uniformly at random, and initialize the clustering via a draw from the prior 
using the local Chinese restaurant process. This is sufficient to roughly estimate (within an 
order of magnitude) the correct number of clusters, which supports efficient density estimation 
from the distributed implementation. 

There is considerable room for further optimization. First, if we were pushing for the largest 
achievable scale, we would use a C++ implementation of the map, reduce and shuffle opera- 
tions. Back-of-the-envelope suggestions suggest performance gains of lOOx should be feasible 
with only minimal memory hierarchy optimization. Second, we would focus on use of a small 
number of many-core machines. Third, we would use a distributed framework such as HaLoop^ 
or MPI, which would permit simultaneous, asynchronous computation and communication. 
Fourth, it is known that tuning various parameters that control Hadoop can result in sig- 
nificant performance enhancements (Herodotou and Babu, 2011). For datasets larger than 
100GB (roughly 50x larger than the datasets considered in this paper), we would want to 
distribute not just the latent state, but also the data itself, perhaps using the Hadoop File 
System (Shvachko, 2010). These advanced distributed implementation techniques are beyond 
the scope of this paper. 



^https : //code . google . com/p/haloop/ 
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Fig 6: Predictive density estimates converge quickly, while latent structure esti- 
mates converge more slowly, (top) Results for our parallel sampler on a synthetic dataset 
consisting of 2048 clusters and 200,000 datapoints. Each line represents a different number 
of compute nodes used to parallelize the sampler: either 2, 8, or 32 nodes were used. The 
purple line represents ground truth. The parallel samplers perform correctly as determined 
by eventual convergence to the true likelihood of the test set. Parallel gains are seen up for up 
to 8 compute nodes, at which point we reach saturation, (bottom) The parallel samplers also 
eventually convert to the true number of clusters, but at a much slower rate than convergence 
to the predictive log likelihood. Two runs with different inferential random seeds are shown 
for each configuration. See the main text for further discussion. 



Williamson et al. (2013), concurrently with and independently of our previous preliminary 
work (Lovell et al., 2012), investigated a related parallel MCMC method based on the same 
auxiliary variable scheme. They focus on multi-core but single-machine implementations and 
on applications to admixtures based on the hierarchical Dirichlet process (HDP) (Teh et al., 
2006). The compatibility of our transition operators with a Map- Reduce implementation en- 
ables us to analyze datasets with lOOx more dimensions than those from Williamson, even 
in the presence of significant inter-machine communication overhead. We also rely purely on 
MCMC throughout, based on initialization from our prior, avoiding the need for heuristic 
initialization based on k-means. Combined, our approaches suggest many models based on 
the DP may admit principled parallel schemes and scale to significantly larger problems than 
are typical in the literature. 

Intuitively, this scheme resembles running "restricted Gibbs" scans over subsets of the 
clusters, then shuffling clusters amongst subsets, which one might expect to yield slower 
convergence to equilibrium than full Gibbs scans. Our auxiliary variable representation shows 
this can be interpreted in terms of exact transition operations for a DPM. 

6. Experiments. We explored the accuracy of our prototype distributed implementation 
on several synthetic and real-world datasets. Our synthetic data was drawn from a balanced 
finite mixture model. Each mixture component 9j was parameterized by a set of coin weights 
drawn from a Beta(/3(i, ^d) distribution, where {^d} is a set of cluster component hyperpa- 
rameters, one per dimension d of the data. The binary data were Bernoulli draws based on 
the weight parameters {6j} of their respective clusters. Our implementation collapsed out the 
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Fig 7: Parallel efficiencies for 32 workers can be seen with IMM rows and 512 
clusters. Consistent with our numerical calculations, larger datasets with more clusters afford 
more opportunities for parallel gains. At this scale, larger than the one from Figure 6, we see 
parallel efficiencies up to 32 workers and no slowdown in latent structure convergence. 



coin weights and updated each /3^ during the reduce step using a Griddy Gibbs (Ritter and 
Tanner, 1992) kernel. 

Figure 5 shows results supporting the accuracy of our inference scheme as a density estima- 
tor given high-dimensional datasets with large numbers of clusters. We see reliable convergence 
to predictive probabilities close to the true entropy of the generating mixture. 

Figure 6 shows the convergence behavior of our sampler: predictive densities (and joint 
probabilities) typically asymptote quickly, while latent structure estimates (such as the num- 
ber of clusters, or the concentration parameter) converge far more slowly. As the Dirichlet 
process is known to not result in consistent estimates of number of clusters for finite mix- 
ture datasets (Miller and Harrison, 2013) (which have no support under the DP prior), it is 
perhaps not surprising that predictive likelihood converges more quickly than estimates of 
number of clusters - especially given our auxiliary variable representation, which may en- 
courage representations with multiple predictively equivalent clusters. It would be interesting 
to characterize the regimes where these dynamics occur, and to determine whether they are 
also present for approximate parallelization schemes or variational algorithms based on our 
auxiliary variable representation. Figure 8 shows a typical pattern of efficiencies for parallel 
computation: for a given problem size, speed increases until some saturation point is reached, 
after which additional compute nodes slow down the computation. In future work we will 
explore means of separating the components of this tradeoff due to communication costs, 
initialization, and any components due to convergence slowdown. 

Finally, in Figures 9 and 10, we show a representative run on a IMM vector subset of the 
Tiny Images (Torralba, 2008) dataset, where we use the Dirichlet process mixture to perform 
vector quantization. The input data are binary features obtained by running a randomized 
approximation to PCA on 100,000 rows and thresholding the top 256 principal components 
into binary values at their component- wise median. After one day of elapsed time and 32 CPU- 
days of computation, the sampler is still making significant progress compressing the data, 
and has converged to the vicinity of 3000 clusters. Serial MCMC (not shown) is intractable 
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Fig 8: Saturation as communication 
costs and convergence slowdown over- 
whelm per-iteration parallelism gains. 

Results on a 500,000 row problem with 1024 
clusters, including 2, 8, 32 and 128 compute 
nodes (for a max of 64 machines), show- 
ing more rapid convergence up to saturation, 
then slower convergence afterwards. 
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Fig 9: An illustration on vector quan- 
tization of a IMM subset of the Tiny 
image dataset with 32 workers. Conver- 
gence of a representative run in terms of pre- 
dictive accuracy and number of clusters. 



on this problem. 

7. Conclusion. We have introduced an auxiliary variable representation of the Dirichlet 
process and applied it to mixture models, where it yields superclusters that cluster the clus- 
ters. We have shown how this representation enables an exact parallelization of the standard 
MCMC schemes for inference, where the DP learns how to parallelize itself, despite the lack of 
conditional independencies in the traditional form of the model. We have also shown that this 
representation naturally meshes with the Map-Reduce approach to distributed computation 
on a large compute cluster, and developed a prototype distributed implementation atop Ama- 
zon's Elastic Compute Cloud, tested on over 50 machines and 100 cores. We have explored 
its performance on synthetic and real-world density estimation problems, including runs on 
over IMM row, 256-dimensional data sets. 

These results point to a potential path forward for "big data" applications of Bayesian 
statistics for models that otherwise lack apparent conditional independencies. We suspect 
searching for auxiliary variable representations that induce independencies may lead to new 
ways to scale up a range of nonparametric Bayesian models, and may perhaps also lead 
to further correspondences with established distributed computing paradigms for MCMC 
inference. We hope our results present a useful step in this direction. 
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Fig 10: (left) 9 input images from a representative cluster, versus 9 random input images, 
showing less visual coherence, (right) 100 binary feature vectors from a single inferred cluster, 
versus 100 random binary feature vectors from the dataset, showing significant compression. 
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